Issue/900 cuda及类cuda embedding #902

wooway777 · 2026-01-09T02:29:17Z

resolves #900
resolves #846
includes #859

天数

沐曦

摩尔

…graph recording - Ensure embedding tensors are on the same device. Change format. - Optimize embedding kernel with vectorized memory access and __ldg - Add vectorized memory access using float4/float2, half2, and bfloat162 - Use __ldg instruction for read-only weight and indices access - Add memory alignment checks to enable vectorized paths - Add __restrict__ keywords for better compiler optimization - Implement dynamic block size selection based on embedding_dim

Copilot

Pull request overview

This pull request adds device-side embedding support for CUDA and CUDA-like platforms (NVIDIA, Metax, Moore) to enable CUDA Graph recording. The key improvement is removing synchronous CPU transfers that previously prevented graph recording, replacing them with fully asynchronous device-side kernel implementations.

Key changes:

Implements device-side embedding kernels for NVIDIA, Metax, and Moore platforms with optimized vectorized memory access
Removes synchronous to(cpu_device) operations from test and production code
Adds comprehensive test suite for validating CUDA Graph recording support

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
test/infinicore/ops/embedding.py	Removed CPU conversion logic, now supports device-side input directly
test/infinicore/nn/embedding.py	Removed CPU conversion logic for testing
test/infinicore/nn/test_embedding_graph_recording.py	New comprehensive test suite for CUDA Graph recording validation
test/infinicore/nn/HOW_TO_USE_GRAPH_RECORDING_TEST.md	Documentation on how to use and interpret the graph recording tests
test/infinicore/nn/EMBEDDING_GRAPH_RECORDING_COMPARISON.md	Technical comparison of before/after implementation details
src/infiniop/ops/embedding/operator.cc	Multi-platform dispatcher for embedding operations
src/infiniop/ops/embedding/embedding.h	Common descriptor macro definition
src/infiniop/ops/embedding/cpu/embedding_cpu.h	CPU implementation header
src/infiniop/ops/embedding/cpu/embedding_cpu.cc	CPU implementation with memcpy-based embedding lookup
src/infiniop/ops/embedding/cuda/embedding_kernel.cuh	Shared CUDA kernel helper functions with vectorized memory access
src/infiniop/ops/embedding/nvidia/embedding_nvidia.cuh	NVIDIA platform header
src/infiniop/ops/embedding/nvidia/embedding_nvidia.cu	NVIDIA CUDA kernel implementation with float4/half2/bfloat162 optimization
src/infiniop/ops/embedding/metax/embedding_metax.cuh	Metax platform header
src/infiniop/ops/embedding/metax/embedding_metax.maca	Metax MACA kernel implementation
src/infiniop/ops/embedding/moore/embedding_moore.h	Moore platform header
src/infiniop/ops/embedding/moore/embedding_moore_kernel.h	Moore-specific kernel helpers
src/infiniop/ops/embedding/moore/embedding_moore.mu	Moore MUSA kernel implementation
src/infinicore/ops/embedding/embedding_infiniop.cc	InfiniOP wrapper with descriptor caching
src/infinicore/ops/embedding/embedding.cc	Op dispatcher and device synchronization logic
src/infinicore/nn/embedding.cc	Simplified forward method using new device-side ops
python/infinicore/nn/functional/embedding.py	Removed CPU-only assertion to support device inputs
include/infiniop/ops/embedding.h	Public API for embedding operations
include/infiniop.h	Added embedding.h include
include/infinicore/ops/embedding.hpp	Added Embedding class with dispatcher pattern
include/infinicore/ops.hpp	Added embedding.hpp include

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/infiniop/ops/embedding/metax/embedding_metax.maca

Copilot · 2026-01-09T02:34:54Z

src/infiniop/ops/embedding/metax/embedding_metax.maca

+            } else if constexpr (std::is_same_v<T, cuda_bfloat16>) {
+                // Use bfloat162 for vectorized access
+                if (embedding_dim >= 2 && embedding_dim % 2 == 0) {
+                    copyVectorizedBFloat162<IndexType>(dst, src, embedding_dim);
+                } else {
+                    copyScalar<T, IndexType>(dst, src, embedding_dim);
+                }


The embeddingKernel template checks for cuda_bfloat16 type on line 48, but this is Metax platform code which should use __hpcc_bfloat16 instead. This type mismatch means the vectorized bfloat16 path (lines 48-54) will never execute on Metax platform. The kernel should check for the Metax-specific type __hpcc_bfloat16 to enable proper vectorization.

src/infiniop/ops/embedding/metax/embedding_metax.maca

Copilot · 2026-01-09T02:34:54Z

src/infiniop/ops/embedding/moore/embedding_moore.mu

+    } else if (_embedding_dim >= 1024) {
+        block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure
+    } else if (_embedding_dim <= 256) {
+        block_size = 384; // Medium embedding_dim: balanced configuration


The block size adjustment logic has an inconsistency. When _embedding_dim <= 256, the code sets block_size = 384 (line 91), but this case only applies if _embedding_dim is greater than 64 (because line 87 handles <= 64). This means embedding_dim values between 65-256 will use block_size=384. However, this contradicts the default of 256 which was already set. The logic should be restructured to avoid overlapping conditions and ensure proper block size selection.

Suggested change

} else if (_embedding_dim >= 1024) {

block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure

} else if (_embedding_dim <= 256) {

block_size = 384; // Medium embedding_dim: balanced configuration

} else if (_embedding_dim <= 256) {

block_size = 384; // Medium embedding_dim: balanced configuration

} else if (_embedding_dim >= 1024) {

block_size = 128; // Large embedding_dim: use smaller block to reduce register pressure

test/infinicore/nn/test_embedding_graph_recording.py

Copilot · 2026-01-09T02:34:55Z

test/infinicore/nn/test_embedding_graph_recording.py

+        print("⚠ CUDA 不可用，跳过图录制测试")
+        return False
+
+    device = infinicore.device("cuda", 0)


Call to a non-callable of builtin-class module.

Copilot · 2026-01-09T02:34:55Z

test/infinicore/nn/test_embedding_graph_recording.py

+        print("⚠ CUDA 不可用，跳过测试")
+        return False
+
+    device = infinicore.device("cuda", 0)


Call to a non-callable of builtin-class module.

test/infinicore/nn/test_embedding_graph_recording.py

wooway777 requested review from a team and Copilot January 9, 2026 02:29

Copilot started reviewing on behalf of wooway777 January 9, 2026 02:29 View session

Copilot AI reviewed Jan 9, 2026

View reviewed changes

wooway777 force-pushed the issue/900-temp branch 2 times, most recently from 98a0be1 to 7997d89 Compare January 9, 2026 02:40

wooway777 added 2 commits January 9, 2026 14:23

issue/900 - support embedding on iluvatar, metax, and moore

b819424

issue/900 - adjust embedding doc text and test script

2c02d73

wooway777 force-pushed the issue/900-temp branch from 7997d89 to 2c02d73 Compare January 9, 2026 09:26

wooway777 requested review from PanZezhong1725, gongchensu and whjthu January 9, 2026 10:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue/900 cuda及类cuda embedding #902

Issue/900 cuda及类cuda embedding #902

Uh oh!

wooway777 commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Copilot AI Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Issue/900 cuda及类cuda embedding #902

Are you sure you want to change the base?

Issue/900 cuda及类cuda embedding #902

Uh oh!

Conversation

wooway777 commented Jan 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants